Using DST’s API from python

data tools
Learn how to extract data from Danmark Statistiks Statistikbanken with python
Published

February 4, 2022

Modified

December 6, 2024

I am a huge fan of Denmark Statistics. Their Statistikbanken contains a wealth of data on the Danish society, economy, and population.

Not only all these data are publicy available, but DST has for years also provided access to all their published data tables through an API, documented here. The API access makes it extremely easy to access and use data. Yet unless one has already some experience in accessing APIs, using it might be complex for an occasional student or analyst.

This notebook provides a quick guide on how to access data from DST’s Statsbanken through their API, and presents a utility class I wrote to more easily access data tables for analytical purposes.

The only explicit dependency of that utility is pandas, which is anyway an extremely widespread package.

Both notebook and class can be found at this GitHub repository.

The utility can be installed by

pip install git+https://github.com/alemartinello/dstapi.git
# Start by importing necessary packages
import requests
import pandas as pd
from IPython.display import display
from io import StringIO

from dstapi import DstApi # The helper class

DST’s API has four separate function calls to programmatically navigate around the published tables. This guide assumes that the analysist has scouted Statistikbanken already, and has identified the one or two tables from which data should be extracted. For these purposes, we only need two function calls: tableinfo and data.

The standard process is to begin by obtaining the necessary information from tableinfo, and then construct the call to pass to data.

This guide will proceed by for each step of the process first showing how to do it by directly using requests (and pandas), and second showing how the utility class DstApi can facilitate the process.

Step 1: Understand what a table has to offer and how it is structured

Our primary example will be DST’s table METROX1, which reports an index measuring the weekly amount of passengers travelling by metro in Copenhagen. This index was developed to measure the population’s response to the COVID pandemic. The table is small and simple, allowing for quick experimentation.

The hard way

As we know the table’s name/id we can start by accessing the API directly through the python package requests, and ask about the table’s metadata (tableinfo).

An API call is composed by a main web address, a function call, and a set of parameters. The main web address is https://api.statbank.dk/v1. The function call in this case is tableinfo. The set of necessary parameters, per documentation, is the id of the table and the format in which we’d like to receive the information. We’ll pick "metrox1" for the first (note that the table-id parameter is case-sensitive), and "JSON" for the second.

The API at DST can be called through both requests.get() and requests.post(). DST’s documentation recommends using post, because as the number and complexity of parameters grows (with some of them containing non-standard Danish characters) it’s harder to embed them in an URL. However, as the call to tableinfo is simple, below I provide examples of using both methods.

Note that the .json() method of the request.Response object serves to return the response content (which we requested in JSON format) rather than the object itself. That’s just to print out the output in the notebook.

This function returns a wealth of information. Not just the table id and description, but also the contact of the statistics responsible, and, crucially, names and values of the variables defining the table. In this case SÆSON and Tid.

The code below shows how to get the table’s metadata, and prints the beginning of the JSON file returned.

Show the code
# Directly embed parameters in the URL with response.get()
requests.get('https://api.statbank.dk/v1' + '/tableinfo' + "?id=metrox1&format=JSON").json()

# Pass a dictionary of parameters to requests.get()
params = {'id': 'metrox1', 'format': 'JSON'}
requests.get('https://api.statbank.dk/v1' + '/tableinfo', params=params).json()

# Use response.post() - note the change in the name of the parameter about the table's name
# I'm also adding here a language parameter - most tables are available in both Danish and English
params = {'table': 'metrox1', 'format': 'JSON', 'lang':'en'}
table_metadata = requests.post(
    'https://api.statbank.dk/v1' + '/tableinfo', json=params
).json()
print(str(table_metadata).replace(',', ',\n')[0:500] + '\n...')
{'id': 'METROX1',
 'text': 'Workday passenger index in the Copenhagen Metro (experimental statistics)',
 'description': 'Workday passenger index in the Copenhagen Metro (experimental statistics) by seasonal adjustment and time',
 'unit': 'Index',
 'suppressedDataValue': '0',
 'updated': '2022-06-16T08:00:00',
 'active': False,
 'contacts': [{'name': 'Peter Ottosen',
 'phone': '+4530429191',
 'mail': 'pot@dst.dk'}],
 'documentation': None,
 'footnote': {'text': 'Data are indexed against an averag
...

This wealth of information is already fantastic. In that metadata there’s pretty much anything you need to figure out if you can actually use the table, and eventually how you want to select the data (seasonally ajusted? For 2020 only?). Yet that JSON file might be tough to digest, especially for more complex tables. Those cases might require preprocessing and a different type of visualization. That’s where the DstApi helper class comes into play.

The easy way

DstApi has two methods for examining metadata.

The first one, tablesummary, summarizes the main metadata information: * The id and description of the table * The last update time * A table with the main available cuts of the data. Each row corresponds to a variable against which we can select, with examples of variable values and labels

Show the code
# Initialize the class with the target table
metro = DstApi('METROX1')

# Get the table summary
metro.tablesummary(language='en')
Table METROX1: Workday passenger index in the Copenhagen Metro (experimental statistics) by seasonal adjustment and time
Last update: 2022-06-16T08:00:00
variable name # values First value First value label Last value Last value label Time variable
0 SÆSON 2 10 Seasonally adjusted 11 Non-seasonally adjusted False
1 Tid 122 2020U01 2020U01 2022U23 2022U23 True

The second method variable_levels zooms into a specific variable and returns a dataframe for each potential variable value. For example, we could check each value of SÆSON

Show the code
metro.variable_levels('SÆSON', language='en')
id text
0 10 Seasonally adjusted
1 11 Non-seasonally adjusted

Now, we already knew these values for this simple table from tablesummary(), as they are only two. But for more complex tables, this method is very handy. Take for example DNVPDKR2, a table showing the circulating amount of mortgage bonds issued by Danish mortgage institutes.

If I wanted for example to extract only data about fixed interest rate, convertible bonds it would be hard to know I should be referring to the value FK in advance.

But I can use

  • method .tablesummary() to see which variables you can select on
  • method .variable_levels() to see which values are available for each variable. Here * is a wildcard that selects all available values for the variable.
Show the code
dnrk = DstApi('DNVPDKR2')
dnrk.tablesummary(language='en')
Table DNVPDKR2: Danish mortgage bonds by type of mortgage bond, original maturity, remaining maturity, coupon (nominal interest rate), currency, issuer, investor sector, covered bonds, data type and time
Last update: 2024-11-28T08:00:00
variable name # values First value First value label Last value Last value label Time variable
0 TYPREAL 9 A0 All mortgage bonds O 1.6 Other mortgage bonds False
1 LØBETID3 7 A0 All original maturities 6 Other (unspecified) False
2 LØBETID2 9 A0 All remaining maturities 8 Other (unspecified) False
3 KUPON2 15 A0 All coupons N Other coupons False
4 VALUTA 6 A0 All currencies O Other False
5 UDSTED 10 A0 All issuers O Other issuers False
6 INVSEKTOR 10 A0 All sectors U 2. Foreign (S.2) False
7 DAEKOBL 4 A0 All mortgage bonds RO RO (other mortgage bonds) False
8 DATAT 5 N1 Stock - Nominal B3 Value adjustments - Market value False
9 Tid 299 1999M12 1999M12 2024M10 2024M10 True
Show the code
dnrk.variable_levels('TYPREAL', language='en')
id text
0 A0 All mortgage bonds
1 FK 1.1 Fixed rate convertible bonds
2 FKU - - 1.1.1 Open for issue - fixed rate converti...
3 FKE - - 1.1.2 No longer open for issue - fixed rat...
4 I 1.2 Indexed bonds
5 RTL 1.3 Adjustable rate bonds (RTL bonds)
6 V 1.4 Bonds with a rererence rate (without inter...
7 VR 1.5 Bonds with a reference rate (with interest...
8 O 1.6 Other mortgage bonds

Step 2: Get the data you need

The first step is essential for designing this second step. First and foremost because we need that information to design the call to data. Second, to make sure we only get out the data we need. Asking for too much data only to then having to throw half of it out locally is wasteful, and ultimately disrespectful for the resources invested into allowing anyone to fire up an API call (I mean how amazing is that?).

The hard way

As for the first step, we’ll start by doing it manually. Here I’ll rely exclusively on request.post() as recommended by DST.

To select the query parameters to pass to the data function appropriatedly one ought to have a careful look at the DATA section in the documentation. Nonetheless, hopefully the examples below will serve to clarify how to construct such calls.

The first two key parameters are, as before, the table name and the format in which we’d like to obtain the data. In the examples below I choose BULK, which has the advantage of being faster and allowing an unlimited number of data rows at export. There are some limitations with this format, such as the inability to perform simple computations (e.g. sums) on the fly. If you need these utilities, you probably don’t need this guide, so I’ll stick with BULK here.

The third crucial parameter is the selection based on the variables shown in e.g. DstApi.tablesummary(). These are mandatory: we need to specify the selection we want to do. We might however choose to include a range of possible values, or all of them, in a selection. In this case, the character * acts as a joker. So to select all values of a variable, we can use *. To select all 2020 weeks in Tid, we could use 2020*.

Below I write the parameters necessary to download the seasonally adjusted (code 10) index for all weeks in the data, and pass them to requests.post(). Finally I print the first 200 characters of the data we received back (in ;-separated format).

params = {
    'table': 'metrox1',
    'format': 'BULK',
    'variables': [
        {'code': 'SÆSON', 'values': ['10']},
        {'code': 'Tid', 'values': ['*']}
    ]
}
r = requests.post('https://api.statbank.dk/v1' + '/data', json=params)
print(r.text[:200])
SÆSON;TID;INDHOLD
Sæsonkorrigeret;2020U01;37,7
Sæsonkorrigeret;2020U08;105,0
Sæsonkorrigeret;2020U09;95,2
Sæsonkorrigeret;2020U10;93,0
Sæsonkorrigeret;2020U11;63,0
Sæsonkorrigeret;2020U12;17,9

Neat! We can then save this data to a csv file or whatever, or directly import it into pandas:

Show the code
pd.read_table(StringIO(r.text), sep=';').head()
SÆSON TID INDHOLD
0 Sæsonkorrigeret 2020U01 37,7
1 Sæsonkorrigeret 2020U08 105,0
2 Sæsonkorrigeret 2020U09 95,2
3 Sæsonkorrigeret 2020U10 93,0
4 Sæsonkorrigeret 2020U11 63,0

Keep in mind that you can also specify intervals for time variables, as in the example below, where I also require the data to be exported in English.

Show the code
params = {
    'table': 'metrox1',
    'format': 'BULK',
    'lang': 'en',
    'variables': [
        {'code': 'SÆSON', 'values': ['11']},
        {'code': 'Tid', 'values': ['>2020U45<=2020U52']}
    ]
}
df = pd.read_csv(
    StringIO(
        requests.post('https://api.statbank.dk/v1' + '/data', json=params).text
    ), sep=';'
)
df.head()
SÆSON TID INDHOLD
0 Non-seasonally adjusted 2020U46 56.2
1 Non-seasonally adjusted 2020U47 55.5
2 Non-seasonally adjusted 2020U48 58.3
3 Non-seasonally adjusted 2020U49 57.6
4 Non-seasonally adjusted 2020U50 46.9

The easy-er way

The code above is already quite compact, but to avoid remembering how to import the data into pandas all the time, DstApi has a method to import the data directly into pandas given a parameter dictionary. So, for example, given the params dictionary defined above, we might call directly

Show the code
metro.get_data(params=params)
SÆSON TID INDHOLD
0 Sæsonkorrigeret 2020U01 37.7
1 Sæsonkorrigeret 2020U08 105.0
2 Sæsonkorrigeret 2020U09 95.2
3 Sæsonkorrigeret 2020U10 93.0
4 Sæsonkorrigeret 2020U11 63.0
... ... ... ...
117 Sæsonkorrigeret 2022U19 99.5
118 Sæsonkorrigeret 2022U20 95.7
119 Sæsonkorrigeret 2022U21 103.1
120 Sæsonkorrigeret 2022U22 108.4
121 Sæsonkorrigeret 2022U23 109.3

122 rows × 3 columns

the .get_data() method has also the built-in option of downloading an entire data table by not passing any parameter dictionary. As mentioned above, this might be (very) wasteful. Some DST tables contain billions of data points. That’s why when used in this way the method asks for explicit confirmation before proceeding.

However, creating the params dictionary itself can be challenging. As we have seen above with table DNVPDKR2, table structures can be complex, and creating the parameter dictionary manually can be cumbersome.

That’s why DstApi has a helper method returning a base dictionary of parameters with default values.

Only pass a default params to .get_data() if you know what you are doing

Some tables in Statistikbanken have millions of records. Downloading them all through the api can take a lot of time, and it’s extremely wasteful if in fact you only need a fraction of the data.

Show the code
# Start by constructing a basic dictionary
dnrk = DstApi('DNVPDKR2')
params = dnrk.define_base_params(language = 'en')
params
{'table': 'dnvpdkr2',
 'format': 'BULK',
 'lang': 'en',
 'variables': [{'code': 'TYPREAL', 'values': ['*']},
  {'code': 'LØBETID3', 'values': ['*']},
  {'code': 'LØBETID2', 'values': ['*']},
  {'code': 'KUPON2', 'values': ['*']},
  {'code': 'VALUTA', 'values': ['*']},
  {'code': 'UDSTED', 'values': ['*']},
  {'code': 'INVSEKTOR', 'values': ['*']},
  {'code': 'DAEKOBL', 'values': ['*']},
  {'code': 'DATAT', 'values': ['*']},
  {'code': 'Tid', 'values': ['*']}]}

Once I have the basic structure, I can copy-paste the dictionary definition and use the method variable_levels to specify the data selection further. For example, I would like to have only bonds issued by Realkredit Danmark, so the code below tells me to use value RD for variable DAEKOBL.

Show the code
dnrk.variable_levels('LØBETID3', language='en')
id text
0 A0 Alle oprindelige løbetider
1 1 Under 10 år
2 2 10-årige
3 3 15-årige
4 4 20-årige
5 5 30-årige
6 6 Andet (uspecificeret)

I can further refine my query filling in the selection parameters required and call the get_data() method to extract the final dataframe.

Show the code
params = {'table': 'dnvpdkr2',
 'format': 'BULK',
 'lang': 'en',
 'variables': [{'code': 'TYPREAL', 'values': ['FK']},
  {'code': 'LØBETID3', 'values': ['5']},
  {'code': 'LØBETID2', 'values': ['A0']},
  {'code': 'KUPON2', 'values': ['A0']},
  {'code': 'VALUTA', 'values': ['DKK']},
  {'code': 'UDSTED', 'values': ['RD']},
  {'code': 'INVSEKTOR', 'values': ['A0']},
  {'code': 'DAEKOBL', 'values': ['A0']},
  {'code': 'DATAT', 'values': ['N1']},
  {'code': 'Tid', 'values': ['*']}]}
df = dnrk.get_data(params=params, language='en')
df.tail()[["TID", "INDHOLD"]]
TID INDHOLD
294 2024M06 229104
295 2024M07 228410
296 2024M08 229411
297 2024M09 231149
298 2024M10 231694

And just like that, I have the full time series of RD’s 30yo fixed interest rate bonds in nominal values.

You can play around with parameters in various ways. For example, here I select a range of weeks in 2020 from the metro table.

Show the code
# Start by constructing a basic dictionary
params = metro._define_base_params(language = 'en')
params['variables'][0]['values'] = ['10']
params['variables'][1]['values'] = ['>2020U45<=2020U52']
metro.get_data(params=params)
SÆSON TID INDHOLD
0 Seasonally adjusted 2020U46 59.0
1 Seasonally adjusted 2020U47 54.0
2 Seasonally adjusted 2020U48 56.2
3 Seasonally adjusted 2020U49 54.5
4 Seasonally adjusted 2020U50 44.3
5 Seasonally adjusted 2020U51 40.2
6 Seasonally adjusted 2020U52 44.8

And that’s it!

I hope this guide was useful, and that the DstApi class can prove as helpful to you as it is for me.

Once again, let me conclude with a shout out to Denmark Statistics, a real national treasure. Thanks for all your work in gathering, organizing, and publishing data for everyone to use. It’s a fantastic service, and one for which you’ll never be thanked enough.